For the following exercises, we will again use data on population from Gapminder.

As per usual, we first need to read in the data. You can just copy, paste and run the following code in(to) your script.

library(readr)
library(dplyr)

gap_pop <- read_csv("../data/gapminder/population_total.csv") %>% 
  rename(country = "Total population")

Again, the data are currently in wide format.

1

Select only data for the 20th century, but this time use a helper function instead of specifying a range of columns.
The helper function you should use here is starts_with(). We also want to keep the country column.
gap_pop %>% 
  select(country, starts_with("19"))
## # A tibble: 275 x 56
##    country  `1900`  `1910`  `1920`  `1930`  `1940`  `1950`  `1951`  `1952`
##    <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Abkhaz~      NA      NA      NA      NA      NA      NA      NA      NA
##  2 Afghan~ 5021241 5351413 5813814 6394908 7034081 7752118 7839426 7934798
##  3 Akroti~      NA      NA      NA      NA      NA   10661   10737   10813
##  4 Albania  819950  901122  963956 1015991 1123210 1263171 1287499 1316086
##  5 Algeria 4946166 5404045 6063800 6876190 7797418 8872247 9039913 9216395
##  6 Americ~    5949    7047    8173   10081   13135   18937   19295   19543
##  7 Andorra    4393    4671    4974    5309    5667    6197    6692    7250
##  8 Angola  2898155 3136718 3387663 3642200 3920011 4354882 4439705 4529381
##  9 Anguil~    3561    3818    4097    4400    4725    5121    5297    5438
## 10 Antigu~   34925   32119   30000   33647   38495   46301   48306   49887
## # ... with 265 more rows, and 47 more variables: `1953` <dbl>,
## #   `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>, `1958` <dbl>,
## #   `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>, `1963` <dbl>,
## #   `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>,
## #   `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>,
## #   `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>,
## #   `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>,
## #   `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>,
## #   `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>,
## #   `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, `1998` <dbl>,
## #   `1999` <dbl>

As you may have already noticed, the dataset comprises some missing data points. Before we start analyzing the data, we might want to know for how many countries we have complete data.

2

Using the dataset in wide format, find out for how many countries we have complete data?
To answer this question you should use the drop_na() function from tidyr.
library(tidyr)

gap_pop %>% 
  drop_na() %>% 
  nrow()
## [1] 229

As in the previous set of data wrangling exercises, we now want to transform the data into the long format.

3

Transform the gap_pop dataset into a sensible long format. Name the variable representing the values for population pop and store the resulting dataframe in a name with the same object as before (gap_pop). Also change the type of the year variable to integer.
This is just a repetition from the Tidy Data exercises. What we want to do is to gather the columns with the years into a year variable. To change the variable type, you need to use mutate().
gap_pop <- gap_pop %>% 
  gather(-country, key = "year", value = "pop") %>% 
  mutate(year = as.integer(year))

Now let’s apply some of the advanced filtering options we discussed in the Data Wrangling - Part 2 session.

4

Create two new dataframes that include different subets of the gap_pop data:

  1. Data for all countries for the 1990s (name this one gap_pop_1990s),

  2. Data for all years but only for Germany (name this one gap_pop_ger).

NB: There are different Germanies in the dataset: West Germany. East Germany, and Germany.
You need to use a helper function from dplyr to create the first new data frame and a specific matching operator to create the second one.
gap_pop_1990s <- gap_pop %>% 
  filter(between(year, 1990, 1999))

gap_pop_1990s
## # A tibble: 2,750 x 3
##    country                year      pop
##    <chr>                 <int>    <dbl>
##  1 Abkhazia               1990       NA
##  2 Afghanistan            1990 12067570
##  3 Akrotiri and Dhekelia  1990    14127
##  4 Albania                1990  3281453
##  5 Algeria                1990 25912364
##  6 American Samoa         1990    47044
##  7 Andorra                1990    54511
##  8 Angola                 1990 11127870
##  9 Anguilla               1990     8334
## 10 Antigua and Barbuda    1990    61906
## # ... with 2,740 more rows
gap_pop_ger <- gap_pop %>% 
  filter(country %in% 
           c("Germany", "West Germany", "East Germany"))

gap_pop_ger
## # A tibble: 243 x 3
##    country       year      pop
##    <chr>        <int>    <dbl>
##  1 East Germany  1800       NA
##  2 Germany       1800 22886919
##  3 West Germany  1800       NA
##  4 East Germany  1810       NA
##  5 Germany       1810 23882461
##  6 West Germany  1810       NA
##  7 East Germany  1820       NA
##  8 Germany       1820 25507768
##  9 West Germany  1820       NA
## 10 East Germany  1830       NA
## # ... with 233 more rows

For some comparisons (especially via plots), it might help to know which continent the country is located on. For this purpose, we will create a new continent variable. As it would be quite tedious to create this variable manually for all of the countries in the dataset, we will do this only for a subset in this exercise. Just run the following code in your local script to create this subset.

gap_pop_subset <- gap_pop %>% 
  filter(country %in% 
           c("Netherlands", "Brazil", "China", "Algeria", "New Zealand"))

5

Create a continent variable for the countries in gap_pop_subset. The variable should be a factor and its values the following: Africa, Americas, Asia, Europe, Oceania.
You can use recode_factor() to create the new variable. Alternatively, you could also use case_when() here. However, the latter would require more typing which is something that we generally want to avoid.
gap_pop_subset %>% 
    mutate(continent = recode_factor(country,
                                    "Algeria" = "Africa",
                                    "Brazil" = "Americas",
                                    "China" = "Asia",
                                    "New Zealand" = "Oceania"
                                     ))
## # A tibble: 405 x 4
##    country      year       pop continent  
##    <chr>       <int>     <dbl> <fct>      
##  1 Algeria      1800   2503218 Africa     
##  2 Brazil       1800   3639636 Americas   
##  3 China        1800 321675013 Asia       
##  4 Netherlands  1800   2254522 Netherlands
##  5 New Zealand  1800    100000 Oceania    
##  6 Algeria      1810   2595056 Africa     
##  7 Brazil       1810   4058652 Americas   
##  8 China        1810 350542958 Asia       
##  9 Netherlands  1810   2293548 Netherlands
## 10 New Zealand  1810    100000 Oceania    
## # ... with 395 more rows
# alternative solution using case_when
# gap_pop_subset %>% 
#  mutate(continent = factor(case_when(
#    country == "Algeria" ~ "Africa",
#    country == "Brazil" ~ "Americas",
#    country == "China" ~ "Asia",
#    country == "Netherlands" ~ "Europe",
#    country == "New Zealand" ~ "Oceania")
#    ))